| Variable Name | Explanation |
|---|---|
| ID | Number corresponding to the precise combination of the features of the model |
| Model Year | Year of the model of the car |
| Make | Brand of the car |
| Model | The model of the car |
| Estimated Annual Petroleum Consumption (Barrels) | Consumption in Petroleum Barrels |
| Fuel Type 1 | First fuel energy source, only source if not an hybrid car |
| City MPG (Fuel Type 1) | Consumption of the car in meter per gallon of fuel when driving in a city, for fuel type 1 |
| Highway MPG (Fuel Type 1) | Consumption of the car in meter per gallon of fuel when driving on a highway, for fuel type 1 |
| Combined MPG (Fuel Type 1) | Combined city and highway car consumption in meter per gallon, for fuel type 1 |
| Fuel Type 2 | Second energy source if hybrid car |
| City MPG (Fuel Type 2) | Consumption of the car in meter per gallon of fuel when driving in a city, for fuel type 2 |
| Highway MPG (Fuel Type 2) | Consumption of the car in meter per gallon of fuel when driving on a highway, for fuel type 2 |
| Combined MPG (Fuel Type 2) | Combined city and highway car consumption in meter per gallon, for fuel type 2 |
| Engine Cylinders | Number of cylinders of the car |
| Engine Displacement | Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers |
| Drive | |
| Engine Description | Description of the car, e.g. Turbo, Stop-Start, ... |
| Transmission | Manual/Automatic transmission, with number of gears and/or model of transmission |
| Vehicle Class | e.g. Minivan, Trucks, Midsize,.... |
| Time to Charge EV (hours at 120v) | |
| Time to Charge EV (hours at 240v) | |
| Range (for EV) | |
| City Range (for EV - Fuel Type 1) | |
| City Range (for EV - Fuel Type 2) | |
| Hwy Range (for EV - Fuel Type 1) | |
| Hwy Range (for EV - Fuel Type 2) |
1 Introduction
• The context and background: course, company name, business context.
During our 1st master year as students in Management - orientation Business Analytics, we have had the opportunity to attend some lectures of Machine Learning for Business Analytics. In content of this class, we have seen multiple machine learning techniques for business context, mainly covering supervised (regressions, trees, support vector machine, neural networks) and unsupervised methods (clustering, PCA, FAMD, Auto-Encoder) but also other topics such as data splitting, ensemble methods and metrics.
• Aim of the investigation: major terms should be defined, the question of research (more generally the issue), why it is of interest and relevant in that context.
In the context of this class, our group have had the opportunity to work on an applied project. From scratch, we had to look for some potential dataset for using on real cases what we have learned in class. Thus, we had found an interesting dataset concerning vehicule MPG, range, engine stats and more, for more than 100 brands. The goal of our research was to predict the make (i.e. the brand) of the car according to its characteristics (consumption, range, fuel type, … ) thanks to a model that we would have trained (using RF, ANN or Trees). As some cars could have several identical characteristics, but could differentiate on various other ones, we thought that it would be pertinent to have a model that was able to predict a car brand, from its features.
• Description of the data and the general material provided and how it was made available (and/or collected, if it is relevant). Only in broad terms however, the data will be further described in a following section. Typically, the origin/source of the data (the company, webpage, etc.), the type of files (Excel files, etc.), and what it contains in broad terms (e.g. “a file containing weekly sales with the factors of interest including in particular the promotion characteristics”).
The csv dataset has been found on data.world, a data catalog platform that gather various open access datasets online. The file contains more than 45’000 rows and 26 columns, each colomn concerning one feature (such as the year of the brand, the model, the consumption per barrel, the highway mpg per fuel type and so on).
• The method that is used, in broad terms, no details needed at this point. E.g. “Model based machine learning will help us quantifying the important factors on the sales”.
Among these columns, we have had to find a machine learning model that could help us quantify the importance of the features in predicting the make of the car. Various models will be tried for both supervised and unsupervised learnings.
• An outlook: a short paragraph indicating from now what will be treated in each following sections/chapters. E.g. “in Section 3, we describe the data. Section 4 is dedicated to the presentation of the text mining methods…” In the following sections, you will find 1st the description in the data, then in Section 2 the method used, in Section 3 the results, in Section 4 our conclusion and recommendations and finally in Section 5 our references. From now on, we will go through different sections. Section 2 will be dedicated in the data description in more depth, mentioning the variables and features, the instances, the type of data and eventually some missing data patterns. Then, the next section will cover Exploratory Data Analysis (EDA), where some vizualisations will be made in order to better perceive some patterns in the variables as well as potential correlation. After that, section 4 will be about the methods which will first be divided between Supervised and then Unsupervised in order to find a suitable model for our project. The results will be discussed right after and we will proceed with a conclusion, as well as recommendations and discussions. Finally, the references and appendix will be visible at the end of the report.
2 Data description
- Description of the data file format (xlsx, csv, text, video, etc.) DONE
- The features or variables: type, units, the range (e.g. the time, numerical, in weeks from January 1, 2012 to December 31, 2015), their coding (numerical, the levels for categorical, etc.), etc. TABLE-NTBF
- The instances: customers, company, products, subjects, etc. DONE
- Missing data pattern: if there are missing data, if they are specific to some features, etc. NTBD
- Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc. NTBD
- If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications. NTBD
“For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in ?@sec-Annex.”
2.0.1 The features or variables: type, units,…
2.1 The instances: customers, company, products, subjects, etc.
In a basic instance, each row is concerning one car. We can find in order the ID of the car corresponding to a precise feature observation, then the features as seen in the table before.
2.2 Missing data pattern: if there are missing data, if they are specific to some features, etc.
2.3 Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc.
2.4 If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications.
EDA:
Columns description
To begin with our EDA, let’s have a look at our dataset and in particular the characteristics of the columns.
Show the code
#to get a detailed summary
skim(data)| Name | data |
| Number of rows | 45896 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| make | 0 | 1.00 | 3 | 34 | 0 | 141 | 0 |
| model | 0 | 1.00 | 1 | 47 | 0 | 4762 | 0 |
| fuel_type_1 | 0 | 1.00 | 6 | 17 | 0 | 6 | 0 |
| fuel_type_2 | 44059 | 0.04 | 3 | 11 | 0 | 4 | 0 |
| drive | 1186 | 0.97 | 13 | 26 | 0 | 7 | 0 |
| engine_Description | 17031 | 0.63 | 1 | 46 | 0 | 589 | 0 |
| transmission | 11 | 1.00 | 12 | 32 | 0 | 40 | 0 |
| vehicle_class | 0 | 1.00 | 4 | 34 | 0 | 34 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1.00 | 23102.11 | 13403.10 | 1.00 | 11474.75 | 23090.50 | 34751.25 | 46332.00 | ▇▇▇▇▇ |
| model_year | 0 | 1.00 | 2003.61 | 12.19 | 1984.00 | 1992.00 | 2005.00 | 2015.00 | 2023.00 | ▇▆▆▇▇ |
| estimated_Annual_Petrolum_Consumption_Barrels | 0 | 1.00 | 15.33 | 4.34 | 0.05 | 12.94 | 14.88 | 17.50 | 42.50 | ▁▇▃▁▁ |
| City_MPG_Fuel_Type_1 | 0 | 1.00 | 19.11 | 10.31 | 6.00 | 15.00 | 17.00 | 21.00 | 150.00 | ▇▁▁▁▁ |
| highway_mpg_fuel_type_1 | 0 | 1.00 | 25.16 | 9.40 | 9.00 | 20.00 | 24.00 | 28.00 | 140.00 | ▇▁▁▁▁ |
| combined_MPG_Fuel_Type_1 | 0 | 1.00 | 21.33 | 9.78 | 7.00 | 17.00 | 20.00 | 23.00 | 142.00 | ▇▁▁▁▁ |
| City_MPG_Fuel_Type_2 | 0 | 1.00 | 0.85 | 6.47 | 0.00 | 0.00 | 0.00 | 0.00 | 145.00 | ▇▁▁▁▁ |
| highway_mpg_fuel_type_2 | 0 | 1.00 | 1.00 | 6.55 | 0.00 | 0.00 | 0.00 | 0.00 | 121.00 | ▇▁▁▁▁ |
| combined_MPG_Fuel_Type_2 | 0 | 1.00 | 0.90 | 6.43 | 0.00 | 0.00 | 0.00 | 0.00 | 133.00 | ▇▁▁▁▁ |
| engine_cylinders | 487 | 0.99 | 5.71 | 1.77 | 2.00 | 4.00 | 6.00 | 6.00 | 16.00 | ▇▇▅▁▁ |
| engine_displacement | 485 | 0.99 | 3.28 | 1.36 | 0.00 | 2.20 | 3.00 | 4.20 | 8.40 | ▁▇▅▂▁ |
| time_to_Charge_EV_hours_at_120v_ | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| charge_time_240v | 0 | 1.00 | 0.11 | 1.01 | 0.00 | 0.00 | 0.00 | 0.00 | 15.30 | ▇▁▁▁▁ |
| range_for_EV | 0 | 1.00 | 2.36 | 24.97 | 0.00 | 0.00 | 0.00 | 0.00 | 520.00 | ▇▁▁▁▁ |
| range_ev_city_fuel_type_1 | 0 | 1.00 | 1.62 | 20.89 | 0.00 | 0.00 | 0.00 | 0.00 | 520.80 | ▇▁▁▁▁ |
| range_ev_city_fuel_type_2 | 0 | 1.00 | 0.17 | 2.73 | 0.00 | 0.00 | 0.00 | 0.00 | 135.28 | ▇▁▁▁▁ |
| range_ev_highway_fuel_type_1 | 0 | 1.00 | 1.51 | 19.70 | 0.00 | 0.00 | 0.00 | 0.00 | 520.50 | ▇▁▁▁▁ |
| range_ev_highway_fuel_type_2 | 0 | 1.00 | 0.16 | 2.46 | 0.00 | 0.00 | 0.00 | 0.00 | 114.76 | ▇▁▁▁▁ |
The dataset that we are working with contains approx. 46’000 rows and 26 columns. We can see that most of our features are concerning the consumption of the cars. In addition, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will be handle these in the section “data cleaning”.
Exploration of the distribution
Here are more details about the distribution of the numerical features.
Show the code
# melt.data <- melt(data)
#
# ggplot(data = melt.data, aes(x = value)) +
# stat_density() +
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 As most of our models in our dataset are not EVs or hybrid cars, it results in having many 0s in our columns. This issue will be tackled in the cleaning data section. Also, some features are numerical discrete, as we can see on the plot of the column “Engine Cylinders”
Outliers Detection
For each one of our numerical column, let’s check thank to the boxplot the outliers per feature
Show the code
#tentative boxplots
data_long <- data %>%
select_if(is.numeric) %>%
pivot_longer(cols = c("ID",
"model_year",
"estimated_Annual_Petrolum_Consumption_Barrels", "City_MPG_Fuel_Type_1",
"highway_mpg_fuel_type_1",
"combined_MPG_Fuel_Type_1",
"City_MPG_Fuel_Type_2",
"highway_mpg_fuel_type_2",
"combined_MPG_Fuel_Type_2",
"time_to_Charge_EV_hours_at_120v_",
"charge_time_240v",
"range_for_EV",
"range_ev_city_fuel_type_1",
"range_ev_city_fuel_type_2",
"range_ev_highway_fuel_type_1",
"range_ev_highway_fuel_type_2"), names_to = "variable", values_to = "value")
ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(outlier.size = 0.5) + # Make outlier points smaller
facet_wrap(~ variable, scales = "free_y") + # Each variable gets its own y-axis
theme_minimal() +
theme(legend.position = "none", # Hide the legend
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) + # Rotate x-axis labels
labs(title = "Boxplots of Variables with Different Scales", x = "", y = "Value")Most of our boxplots are showing extreme values. Again, this is due to the small amount of EV and hybrid cars in our dataset compared to the rest of the models and due to the nature of some feature, concerning only those type of vehicles.
Show the code
#Now
# plot_correlation(data) #drop time charge EV 120V
# create_report(data)
#nb cars per brandnumber of models per make
Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.
Show the code
#Number of occurences/model per make
nb_model_per_make <- data %>%
group_by(make, model) %>%
summarise(Number = n(), .groups = 'drop') %>%
group_by(make) %>%
summarise(Models_Per_Make = n(), .groups = 'drop') %>%
arrange(desc(Models_Per_Make))
#makes with only 1 model
only_one_model_cars <- nb_model_per_make %>%
filter(Models_Per_Make == 1 )
#table globale
datatable(nb_model_per_make,
rownames = FALSE,
options = list(pageLength = 10,
class = "hover",
searchHighlight = TRUE))Show the code
#table only 1 model
datatable(only_one_model_cars,
rownames = FALSE,
options = list(pageLength = 10,
class = "hover",
searchHighlight = TRUE))Show the code
# Option to limit to top 20 makes for better readability
top_n_makes <- nb_model_per_make %>% top_n(20, Models_Per_Make)
# Reordering the Make variable within the plotting code to make it ordered by Models_Per_Make descending
# nb_model_per_make$Make <- factor(nb_model_per_make$Make, levels = nb_model_per_make$Make[order(-nb_model_per_make$Models_Per_Make)])Show the code
ggplot(top_n_makes, aes(x = reorder(make, Models_Per_Make), y = Models_Per_Make)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Models per Make (Top 20)",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readabilityShow the code
ggplot(only_one_model_cars, aes(x = make, y = Models_Per_Make)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Makes with only 1 model",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readabilityTherefore, we can see that Mercendes-Benz and BMW have significantly more models in our dataset, which means that we are dealing with some imbalances in categories. Therefore, we need to be careful when doing predictions, as will may encounter bias toward these two majority classes. Therefore, there are few technics that can be used to deal with this problem, such as resampling technics, Ensemble Methods (RF, Boosting), tuning probability threshold
https://chatgpt.com/c/09a66e4e-80c6-4fbd-bf4e-73a2b3e44afd
Correlation matrix for numerical features
Show the code
#Here I encounter some problems with some of the variables.
#check NAs
# colSums(is.na(data))
data_corrplot <- data %>%
select_if(is.numeric)
# Identify constant columns (columns with zero standard deviation)
constant_columns <- sapply(data_corrplot, function(x) sd(x, na.rm = TRUE) == 0)
# Print constant columns for inspection
# print("Constant columns (standard deviation is zero):")
# print(names(data_corrplot)[constant_columns])
# Remove constant columns
data_corrplot <- data_corrplot[, !constant_columns]
# Correlation transformation for plot using complete observations
cor_matrix <- cor(data_corrplot, use = "complete.obs")
# Melt correlation matrix for plotting
cor_melted <- melt(cor_matrix)
# Plot correlation matrix heatmap using ggplot2
ggplot(data = cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1),
axis.text.y = element_text(size = 8)) +
coord_fixed() +
labs(x = '', y = '', title = 'Correlation Matrix Heatmap')3 Data cleaning
In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.
Let’s have a look at the entire dataset and its missing values in grey.
We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Below we have the detail of the percentage of missing values by columns.
3.1 Dealing with the columns Engine Cylinders and Engine Displacement
As we can see we have missing in 6 columns. Let’s first have a closer look at the engine cylinders and engine displacement columns. They both have 484 missing values. After some data manipulation, we see that these 484 missing are all electric vehicles and that they all have missing values in the engine cylinders and engine displacement columns. Given that in our dataset we have 484 vehicles, we now that theses missing in these column only concerns electric vehicles. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.
As we can see, we still have some missing in the columns “Fuel Type 2”, “Engine Description”, “Drive” and “Transmission”. Let’s investigate the missing in the column “Drive”.
3.2 Dealing with the column Drive, Transmission and Engine Description
We decided to drop the brand with more than 10% of missing values in the “Drive” column. After this operation, we also removed the 8 observations that remained with missing values in the “Transmission” column. We decided to drop the column engine description since it contains missings values for more than a third of our observation.
3.3 Final dataset
The dataset is now cleaned and does not contain any missing values. It contains 42240 observations, 18 features and 129 brands. We renamed the columns and stored it in a csv file (data_cleaned.csv). However, for some models, we need to tackle the unbalanced classes in the target variable. For this reason we also created a new csv file for which we drop the make with less than 10 models (data_cleaned_reduced.csv). This dataset contains 42061 observations, 18 features and 66 brands.
Here are the cleaned datasets on which we are working on from now on.